So, in the first workshop we introduced you to R and RStudio, and taught you the basic ways in which you can use R. To quickly refresh, we covered:
Inf, NaN, NA[]In this practical we are going to take a bit of a sideways step - learning skills which are parallel to R but not actually to do with R itself. These will be critical skills to ensuring you can easily create widely shareable open science projects, and will also be used during your projects later in the year.
At the end of the last session I set homework to do some basic calculations using R. Here is my code to run those calculations:
## create a vector of random numners
ran_50 <- runif(100, 0, 50)
##order them from smallest to largest
sort_ran_50 <- sort(ran_50)
##write the function
my_fun <- function(x){
##subtract log10(x) from x
y <- x - log10(x)
##return the new vector
return(y)
}
##run the function on your random numbers
new_data <- my_fun(sort_ran_50)
##calculate mean, sd, and se:
##calcualte mean
mean_dat <- mean(new_data)
##calcualte SD
sd_dat <- sd(new_data)
##the function for se
se <- function(x)sd(x)/sqrt(length(x))
se_dat <- se(new_data)
##results
results <- c("mean" = mean_dat,
"sd" = sd_dat,
"se" = se_dat)
Now, on to this weeks work.
Github is an online repository aimed at making two things easy: (1) sharing/collaborating on code and (2) version control. Github is actually a front end to Git - the underlying code that makes the sharing/collaborating and version control actually work. Github gives us a nice user-friendly front end where we can do all the important stuff Git does whilst not having to learn the Git commands ourself. Github is widely used to assist in writing code, particularly when there are many different people involved in writing complex scripts which are hard to keep track of. In essence it will save your R folders from looking like this:
The figure to the right gives you an approximate idea of how Git and Github work. Lets work from the bottom up.
Github works with what are called repositories. These are folders on your computer which contain code, data, and files called read me files which describe what the repository contains. These files can be organised into sub folders (in the diagram to the right we have two subfolders in Repository 1 called Data and R code). You will end up with lots of these repositories, usually one for each project you are working on.
The code, data, and other files in these repositories are tracked and catalouged by a GUI for Github called Github Desktop. As we said earlier Github is an online repository, and this desktop GUI allows us to sync our desktop repositories with the stored versioins of these repositories online.
These repositories are stored under your profile at Github.com - you can see mine here. A repository can be marked as private (only you and invited collaborators can work on it) or public (anyone can download your code and data).
Github is actually a front end for Git, which does all the heavy lifting of tracking the changes to code, data, etc which form the backbone of what Github is useful for.
Git/Github work in much the same way as track-changes does in Microsoft Word (if you have used that). You will make an initial copy of your repository (including any code which is already in there) and you will sync this to Github.com. Then, whenever you make changes to the documents in your repository, the Github desktop GUI will track these changes and then when you ask it to it will upload these changes onto Github.com.
Today you will learn how to set up a Github account and start a repository, learn about how to see what changes have been made to the files in your repository, and how to branch and clone repositories to get access to other peoples code and data.
You will then be adding code and documents to your Github repository as we go through this course, culminating in the submission of your projects through Github.
The first thing to do is make a Github account - this will give you access to all the magical powers of Github. And it’s free! Go to the Github homepage and you should see a big box on the right hand side saying “Sign up”. Alternatively try here. Go through the steps and when you have the option select “Individual - Free” in the account type page. When you have verified your account you will be asked if you want to create your first repository - don’t do this, insted either quit your browser window or navigate to https://github.com/.
Next you’ll need to download and install Github desktop, selecting the correct operating system. Once this is installed you will need to sign into your Github account (through the prompted window). Great, now we can start using Github.
There are three was of making a github repository. The simplest and best (in my oppinion) is to to create one via Github desktop.
NOTE - Github repositories have some issues when created in Google Drive, OneDrive or Dropbox on your computer. The solution to this is to ensure that all repositories are created on your local computer and are not being synced via any of these services.
To do this navigate toe File -> New repository:
Which will bring up the following box:
Here will will tell Github some basic information about our repository. First off we need to name it. I will leave this up to you, but consider this is going to be the repository you are using the this R course over the next few weeks (i.e. name it something meaningly, not “Repository 1”!). The add a description - this doesnt need to be lots of detail, just a few lines on what this repository is going to be used for.
Next choose a place to make this repository. Where this will be depends on how you organise your files on your computer - I have a Work folder inside which I keep folders for each of my projects. Because Github repositories are limited to 1GB of storage, it sometimes isnt appropriate to keep all of the files for a project in your repository. So in the example below you might just keep the R code in a github repository, whilst the rest is kept on your local drive (but is obviously backed up somewhere else!).
So, decide where you are going to place your repository using the “Local Path” option
Select the “Initialize this repository with a README” box - we will go into that more in a minute.
The Git Ignore option tells Github what sort of files are going into this repository. Select R from the drop down list. We do this so that Github doesn’t include any temporary files R creates whilst running code in the upload to our online Github repository.
We can ignore the “License” bit - but just know that if you are writing packages or libraries in the future it will be worth looking into these and deciding what sort of license you wan’t to apply to your code (for an explanation of the different types see here).
So for me, the box now looks like this:
Go ahead and click Create Repository!
If you now navigate to the Local Path you specified when you initilised the repository you should now see that Github has created a folder on your computer, if you open that folder you will see a README.md file.
README.mdLet’s briefly cover the README.md file that Github created. If you open this (using a text editor of some description - I use TextEdit on Mac OS) you will see that the README file has initiated with the title and description you put in during the repository initialisation earlier. Mine are:
# Bioinformatics test repository
An example repository for the Bioinformatics masters course
This README file is (as the name suggests) a file you should read when you are interested in what is in the repository. It gives you a place to give details about the data and methods, links to the publication/s which this repository supports, the author/s of the code, owners of the data, help guides for using the methods, etc etc.
the # denotes a title - one # is the largest title format, ## is the next largest title, and so on.
For now add the a title with two ## and your name to the README.md (ensuring there is at least 1 empty line below your name in the .md file) and save and close the file (we will pick this up again later).
# Bioinformatics test repository
An example repository for the Bioinformatics masters course
## Author
Chris Clements
So far you have created a repository, but it is not synced to your online Github repository. If you open Github Desktop you will see something which looks like this (it might vary a bit between Windows and Mac):
This is where we will interface between your changes to your files and Github online. There are a number of we need to know.
So, what is Committing?
A “commit” is basically Github’s way of saying “save”. When you make a “commit” you are saying that you want to save the project at this point with any associated changes to the files within the repository. This will become critical later as we start to think about version control - the ability to go back to a previous version of the repository and start again from there. There is a balance to be considered regarding how often you commit - if you commit every small change it is very difficult to find the point you want to go back to, if you don’t commit enough thAlong with the commit we add comments to say what we have done since the last time we committed. If we look again at the following:
We can see that number 6 is both where we add these comments and where we make the commit. Its given us a suggested title for this commit, but lets write in “Initial commit” to signify this is the first time we have made changes to this repository, and in the description you can make some notes saying that this is the initial commit and you have made changes to the README.md.
Once you have done this then go ahead and click “Commit to master”. Note the bold, this signifys we are commiting to the master branch (number 2 in the figure, more on branches below).
You will see that once you have clicked “Commit to master” then all the information displayed disappears and you are left with a sign saying “No local changes”. This is telling us that there have been no changes to any of the files in our repository since we last committed.
You will also see below the “No local changes” that there is a message saying that our repository is only available on our local machine, with a prompt to “Publish your repository to GitHub”. This highlights an important point - a “commit” is local. To make these changes appear on our Github profile online we need to “push” these changes to Github.
“Pushing” is Github speak for making sure our local changes to our repositories are published to our online repository. Because we haven’t pushed any changes before, in this case “pushing” to Github will also publish our repository. After your initial commit the button will read “Push Origin”. That’s what we want to do, so go ahead and click “Publish your repository to GitHub”. You will see a box appear giving you some options:
Here we can change the name of our repository and description, and we have the option to keep this code private (make sure this is slected - you can make your code public at any time, but once you have made your code public you cant make it private again!).
You can ignore the Organization section too as this isnt relevant (if you want to know more about them then have a read here).
Once you are happy, click “Publish Repository”.
So we have published our Repository. Let’s login to Github online and see what it looks like there.
Once you have logged in navigate to your profile, and then click on “Repositories”. You should now see the repository you just published. If you click on the repository you will then see the following:
There is a list of the files in your repository (not much at the moment), and the README.md is conveniently displayed as a formatted document below.
Some information on the history of your repository is displayed at the top - the number of commits and number of branches (see below) are the ones most relevant to you. There are also some useful buttons it is worth noting at this point - the “Clone or download” button and “New pull request”. More on those later too.
Branching and version contron are not the same thing, but they are related. We mentioned version control earlier on: its like track changes in Microsoft word - you can see the changes which have been done, and you can roll back to a previous version of the document if what you are doing didn’t work, or you need to use a new method or technique. Branching is a way of having multiple parallel repositories which you can work on independantly, and then merge the changes back together to form a single repository again. Lets cover branching first.
When you make a Github repository you create the Master branch - this is the main working branch of the repository. You can make and commit changes to this master branch as we did above.
However, lets imagine we want to carry out a statistical analysis. We think we want to use a generalised linear model (GLM) on our data but we aren’t 100% sure. Instead of doing this coding and committing these changes to the Master branch - and then having to undo them and some later date, a better option is to creat a new branch. This branch will run in parallel to the master branch and alloes us to test out our analysis without being 100% sure it will work.
Take the diagram above. We have our original master branch, and we have branched off from this master (1) and made a commit to that branch (2 - may be some data tidying and sorting before our analysis). Then we think about our analyses - we aren’t sure if a GLM or a generalised linear mixed-effect model (GLMM) framework is going to be most appropriate. So we make another branch (3) and we try out the GLM approach. We make a commit (4) but realise that this isn’t likely to be the best way to analyse these data. We could delete all the GLM work we have just dont and start over again with the GLMM work, but its better to not throw away all our GLM work in case we want to come back to it later. Instead we can just revert back to our original branch (at 3), and develop the analysis there using the GLMM method instead. This leaves our GLM branch hanging by itself, but thats fine - we have a copy of all the code if we ever want to go back to it. Once we are happy that the GLMM approach is the best option and the analysis is finished, we can then merge this code back in with the master branch (5).
In the above example we have implicitly covered version control - its the ability to move back to an earlier version of our repository (from 4 to 3). In that example we moved back to an earlier position on a different branch, but you can do the same thing along a single branch - moving back to an earlier version of the code you are working on on the same branch. We can roll back to any point we have made a “Commit” - so choosing when and how frequently to make commits is really important, as is keeping good notes to allow you to find exactly which version of the code you want to roll back to. So make your commit notes useful!
Making a branch is easy using Github desktop:
Go ahead and make a branch - pick a name (for this I am just going to use test-branch) and hit Create. Nothing much appears to have changed, but if you look at the Current Branch drop down menu you’ll see that you now have two branches: the master and the test-branch:
We are currently in test-branch and you can return to the master branch by simply clicking on it (but don’t, lets stay in the test-branch for now). On our earlier schematic we are now between 1 and 2:
To see how useful branches can be lets make some changes to our repository. In your repository create a folder called “R code”, then open RStudio and make a new R script. Then lets write some simple R script:
## this command clears R's memory.
## it will delete all of the loaded in data, and any objects and code which has been run.
## its useful to have this at the beginning of all your
## R projects to make sure anything saved in R's memory isn't effecting your code or results
## (I start almost all my R scripts with this)
rm(list = ls())
##make a vector of random numbers from a uniform distribution:
normal_nums <- rnorm(n = 100, mean = 10, sd = 2)
Then save this into our new R code folder. If you go back to Github desktop you will see these changes (addition of the R code) have appeared in your repository. Go ahead and commit those (with notes!) and Publish branch. If you look at your online repository you will now see you have 2 branches:
If you select the test-branch you’ll see your new folder has been synced - and inside it is your R code. Fantastic! We are now at point 2:
So we know we are working on our branch, what is going on on our master branch? The short answer is nothing. We can view the other branches of our repository at any time from the drop down menu in Github desktop - try it now by select the master branch from the Current branch drop down menu. Now, go back to your repository folder on your local computer and you’ll see that everything you had put into the repository on the brach (the R code folder and the code inside it) has dissapeared! This is because we are back at point 1 on our schematic. You can switch between any one of your branches at any time using this menu - Github stores them all on your computer at the same time, and just allows you to view/edit them when you select them in Github desktop. Smart!
Right, lets go back to our fork and work some more on our R code. Let’s write a simple function, one we wrote last week. Add the following to the R script you are working on in your repository:
##Calculate standard error
se <- function(x){
##standard deviation divided by the square root of the number of observations
std_er<-sd(x)/sqrt(length(x))
##return the answer
return(std_er)
}
Save that and commit it.
Then lets modify the function above so that it returns both the standard error, and the mean, and then try that out to see if it works:
##Calculate standard error
se_mean <- function(x){
##standard deviation divided by the square root of the number of observations
std_er<-sd(x)/sqrt(length(x))
##return the answers
return(c("mean" = mean(x), "se" = std_er))
}
##run it on our vector
se_mean(normal_nums)
If you run your script you should see that that works fine, and you get a vector returned looking like this:
mean se
10.3460555 0.2156303
Great, lets save that and commit as well. You can push it to the origin too to make sure your Github online repository is at the same stage. We are now at somewhere between points 3 and 5 on our schematic:
Right, lets imagine that we tried the se_mean function route but have decided that this isnt what we want to do, we want to go back to the original se function instead. We need to do a reset of the file in our repository. Remember doing this means we still have the se_mean version in our history in case we ever need to go back to it or access some of the code. There are a whole host of ways to do this depending on exactly what you want to do. We will cover 2 quick and simple ones which are fine for this prupose.
The easiest way is to navigate to your repository on github and get the version of the file you want, download it, and save it over the file you want to reset in your local repository: